无监督的域适应性(UDA)引起了相当大的关注,这将知识从富含标签的源域转移到相关但未标记的目标域。减少域间差异一直是提高UDA性能的关键因素,尤其是对于源域和目标域之间存在较大差距的任务。为此,我们提出了一种新颖的风格感知功能融合方法(SAFF),以弥合大域间隙和转移知识,同时减轻阶级歧视性信息的丧失。受到人类传递推理和学习能力的启发,研究了一种新颖的风格感知的自我互化领域(SSID),通过一系列中级辅助综合概念将两个看似无关的概念联系起来。具体而言,我们提出了一种新颖的SSID学习策略,该策略从源和目标域中选择样本作为锚点,然后随机融合这些锚的对象和样式特征,以生成具有标记和样式丰富的中级辅助功能以进行知识转移。此外,我们设计了一个外部存储库来存储和更新指定的标记功能,以获得稳定的类功能和班级样式功能。基于提议的内存库,内部和域间损耗功能旨在提高类识别能力和特征兼容性。同时,我们通过无限抽样模拟SSID的丰富潜在特征空间,并通过数学理论模拟损失函数的收敛性。最后,我们对常用的域自适应基准测试进行了全面的实验,以评估所提出的SAFF,并且实验结果表明,所提出的SAFF可以轻松地与不同的骨干网络结合在一起,并获得更好的性能作为插入插型模块。
translated by 谷歌翻译
Hand and face play an important role in expressing sign language. Their features are usually especially leveraged to improve system performance. However, to effectively extract visual representations and capture trajectories for hands and face, previous methods always come at high computations with increased training complexity. They usually employ extra heavy pose-estimation networks to locate human body keypoints or rely on additional pre-extracted heatmaps for supervision. To relieve this problem, we propose a self-emphasizing network (SEN) to emphasize informative spatial regions in a self-motivated way, with few extra computations and without additional expensive supervision. Specifically, SEN first employs a lightweight subnetwork to incorporate local spatial-temporal features to identify informative regions, and then dynamically augment original features via attention maps. It's also observed that not all frames contribute equally to recognition. We present a temporal self-emphasizing module to adaptively emphasize those discriminative frames and suppress redundant ones. A comprehensive comparison with previous methods equipped with hand and face features demonstrates the superiority of our method, even though they always require huge computations and rely on expensive extra supervision. Remarkably, with few extra computations, SEN achieves new state-of-the-art accuracy on four large-scale datasets, PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. Visualizations verify the effects of SEN on emphasizing informative spatial and temporal features. Code is available at https://github.com/hulianyuyy/SEN_CSLR
translated by 谷歌翻译
Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical $p$-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.
translated by 谷歌翻译
对于基于骨架的动作识别中的当前方法通常是将长期时间依赖性作为骨骼序列捕获通常长的(> 128帧),这很常见,这对于先前的方法构成了一个具有挑战性的问题。在这种情况下,短期依赖性很少被正式考虑,这对于对类似动作进行分类至关重要。大多数当前的方法包括相互交织的仅空间模块和仅时间的模块,在这些模块中,在相邻框架中的关节之间的直接信息流受到阻碍,因此不如捕获短期运动并区分相似的动作对。为了应对这一限制,我们提出了一个作为stgat创造的一般框架,以建模跨天空信息流。它使仅空间模块与区域感知的时空建模相称。尽管STGAT在理论上对时空建模具有有效性,但我们提出了三个简单的模块,以减少局部时空特征冗余,并进一步释放STGAT的潜力,(1)(1)自我关注机制的范围,(2)动态重量的范围(2)沿时间尺寸的关节和(3)分别与静态特征分开的微妙运动。作为一个可靠的特征提取器,STGAT在对以前的方法进行分类时,在定性和定量结果中都证明了相似的动作。 STGAT在三个大规模数据集上实现了最先进的性能:NTU RGB+D 60,NTU RGB+D 120和动力学骨架400。释放了代码。
translated by 谷歌翻译
合并方法是现代神经网络增加接受场并降低计算成本的必要性。但是,通常使用的手工制作的合并方法,例如,最大池和平均合并,可能无法保持判别特征。尽管许多研究人员在空间域中精心设计了各种汇集变体,以便在这些局限性方面处理这些局限性,但很少访问直接使用手工制作的方法,或者这些专业的空间变体可能不是最佳的。在本文中,我们从信号处理中的提升方案中得出了时间升降机池(TLP),以智能地逐步划分不同的时间层次结构。提升方案将输入信号分配到具有不同频率的各种子兰,这可以看作是不同的时间运动模式。我们的TLP是一个三阶段的过程,它执行信号分解,组件加权和信息融合以生成精致尺寸的特征图。我们选择具有长序列的典型时间任务,即连续的手语识别(CSLR)作为验证TLP的有效性的测试台。两个大规模数据集的实验表明,TLP的表现优于手工制作的方法和专门的空间变体,其较大的边距(1.5%)具有相似的计算开销。作为功​​能强大的功能提取器,TLP在各种数据集上的多个骨干上表现出很大的概括性,并在两个大规模的CSLR数据集上实现了新的最新结果。可视化进一步证明了TLP在校正光泽边界中的机制。代码已发布。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Knowledge graphs (KG) have served as the key component of various natural language processing applications. Commonsense knowledge graphs (CKG) are a special type of KG, where entities and relations are composed of free-form text. However, previous works in KG completion and CKG completion suffer from long-tail relations and newly-added relations which do not have many know triples for training. In light of this, few-shot KG completion (FKGC), which requires the strengths of graph representation learning and few-shot learning, has been proposed to challenge the problem of limited annotated data. In this paper, we comprehensively survey previous attempts on such tasks in the form of a series of methods and applications. Specifically, we first introduce FKGC challenges, commonly used KGs, and CKGs. Then we systematically categorize and summarize existing works in terms of the type of KGs and the methods. Finally, we present applications of FKGC models on prediction tasks in different areas and share our thoughts on future research directions of FKGC.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译